# import necessary packages
import os
import json
import gzip
import pandas as pd
from urllib.request import urlopen
import networkx as nx
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import community
import nltk
#from nltk.corpus import stopwords
# download stopwords and word tokenizer from NLTK
nltk.download('stopwords')
nltk.download('punkt')
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from wordcloud import WordCloud
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import netwulf as nw
[nltk_data] Downloading package stopwords to /Users/danyu/nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package punkt to /Users/danyu/nltk_data... [nltk_data] Package punkt is already up-to-date!
In the era of digital music, the rise of online music streaming platforms has brought significant changes to the music industry, including the way music is consumed, produced, and reviewed. As a result, large amounts of data are now available for analysis, offering valuable insights into the preferences, behaviors, and opinions of music listeners. This project aims to analyze the social network constructed using a subset of Vinyl records from the Amazon Review Data (2018) and to uncover the genre of music that the reviews are based on. Vinyl records are chosen specifically, as it is a relatively old medium and it would provide an interesting lens to examine how people perceive this format in today's world.
Furthermore, a hypothesis is made based on the common advertisements for the medium and its accessories that being that most vinyl collectors are interested in the rock genre. This makes it interesting to explore whether or not the hypothesis can be confirmed or if there are other genres that may have a higher pervasiveness in the medium of vinyl records.
This can help producers and sellers target their marketing strategies better, appealing to the target audience and improving sales.
The goal for the end user’s experience is to understand how the analysis is done in order to investigate the main problem of this project.
The reason for choosing the Amazon Dataset was because it contained a large number of reviews of Vinyl records which were also easily available and filtered. More details regarding the dataset are available in the next section.
# load the downloaded raw data
data_music = []
for line in open('Digital_Music.json', 'r'):
data_music.append(json.loads(line))
# total length of list, this number equals total number of products
print(f'Total number of reviews in Digital Music (5-core) is {len(data_music)}')
print('Here is an example of raw data')
# first row of the list
print(data_music[0])
Total number of reviews in Digital Music (5-core) is 1584082
Here is an example of raw data
{'overall': 5.0, 'verified': True, 'reviewTime': '12 22, 2013', 'reviewerID': 'A1ZCPG3D3HGRSS', 'asin': '0001388703', 'style': {'Format:': ' Audio CD'}, 'reviewerName': 'mark l. massey', 'reviewText': 'This is a great cd full of worship favorites!! All time great Keith green songs. His best album by far.', 'summary': 'Great worship cd', 'unixReviewTime': 1387670400}
# convert raw data into pandas dataframe
def getDF(data):
i = 0
df = {}
for d in data:
df[i] = d
i += 1
return pd.DataFrame.from_dict(df, orient='index')
df = getDF(data_music)
print(f'The digital music (5-core) dataframe contains the following variables: {df.columns}')
The digital music (5-core) dataframe contains the following variables: Index(['overall', 'verified', 'reviewTime', 'reviewerID', 'asin', 'style',
'reviewerName', 'reviewText', 'summary', 'unixReviewTime', 'vote',
'image'],
dtype='object')
Data preprocessing was done in the following steps:
The key 'Format:' in variable 'style' in the Digital Music (5-core) data contains information about product type and this variable is used to extract a vinyl records subset from the original data set. This is done by filtering out the reviews, where the key 'Format:' in 'style' variable is missing, since the product type could not be determined without this variable.
# Remove rows where the 'style' column is empty or NaN
df_filtered = df.dropna(subset=['style'])
# Filter out rows where the 'Format' key does not exist
df_filtered = df_filtered[[isinstance(d, dict) and 'Format:' in d for d in df_filtered['style']]]
The number of reviews for each product type is computed to see the distribution and it is observed that vinyl records are the product type that has the third most reviews in the original data. Then, a vinyl records subset of the original data was constructed by only considering reviews for vinyl records since we are only interested in reviews for vinyl records. Another reason why we choose vinyl records is that this subset of Digital Music (5-core) is that this particular subset of the original data is feasible for the following analysis in the project.
# Extract the values of the inner dictionaries
styles = [d['Format:'] for d in df_filtered['style']]
# Count the occurrences of each style
style_counts = Counter(styles)
print('The following is the product type distribution of the Digital Music (5-core) data printed in order from the product type receiving the most reviews')
print(style_counts)
The following is the product type distribution of the Digital Music (5-core) data printed in order from the product type receiving the most reviews
Counter({' MP3 Music': 993745, ' Audio CD': 286917, ' Vinyl': 27391, ' DVD': 923, ' Audio Cassette': 821, ' DVD Audio': 288, ' Amazon Video': 171, ' Blu-ray Audio': 162, ' Blu-ray': 143, ' Paperback': 70, ' Hardcover': 51, ' Vinyl Bound': 17, ' Kindle Edition': 13, ' USB Memory Stick': 12, ' CD-ROM': 11, ' DVD-ROM': 7, ' Health and Beauty': 7, ' Unknown Binding': 7, ' Accessory': 7, ' VHS Tape': 6, ' Spiral-bound': 6, ' Kitchen': 5, ' Mass Market Paperback': 5, ' Personal Computers': 3, ' Apparel': 3, ' Prime Video': 3, ' Video CD': 3, ' Grocery': 2, ' Office Product': 2, ' MP3 CD': 2, ' Audible Audiobook': 1, ' Laser Disc': 1, ' Home': 1, ' Perfect Paperback': 1, ' Unbound': 1, ' CD Video': 1, ' Misc. Supplies': 1, ' CD-R': 1, ' Calendar': 1})
# Contruct the vinyl records subset
style_to_keep = ' Vinyl'
df_filtered_Vinyl = df_filtered[df_filtered['style'].apply(lambda x: x.get('Format:') == style_to_keep)]
This is done since we are only interested in utilizing the 5 features mentioned above.
df_filtered_Vinyl = df_filtered_Vinyl.drop(columns=['verified', 'reviewTime', 'summary', 'unixReviewTime', 'vote', 'image', 'style'])
It is observed that the data contains duplicates and they should be removed. Also, data without review text are also removed since the review text is necessary for conducting analysis later on.
# drop duplicates in data
df_filtered_Vinyl = df_filtered_Vinyl.drop_duplicates()
# drop all rows where the "reviewText" column is NaN
df_filtered_Vinyl = df_filtered_Vinyl.dropna(subset=['reviewText'])
This is done in order to construct a network such that every node in the network is connected to at least two other nodes.
num_rows = df_filtered_Vinyl.shape[0]
# Group the dataframe by asin and create a set of reviewerIDs for each asin
grouped_total = df_filtered_Vinyl.groupby('asin')['reviewerID'].apply(list)
grouped = grouped_total
# removing products with less than 3 reviews
productID_one_review = []
for productID, reviewer in grouped.items():
if len(reviewer) < 3:
productID_one_review.append(productID)
grouped = grouped.drop(productID)
df_filtered_Vinyl_3 = df_filtered_Vinyl[~df_filtered_Vinyl['asin'].isin(productID_one_review)]
num_rows_3 = df_filtered_Vinyl_3.shape[0]
print(f'Number of reviews for vinyl records before removing products with less than 3 reviews: {num_rows}')
print(f'number of reviews for vinyl records after removing products with less than 3 reviews: {num_rows_3}')
print(f'Number of reviewed vinyl records before removing products with less than 3 reviews: {len(grouped_total)}')
print(f'Number of reviewed vinyl records after removing products with less than 3 reviews: {len(grouped)}')
Number of reviews for vinyl records before removing products with less than 3 reviews: 27118 number of reviews for vinyl records after removing products with less than 3 reviews: 14822 Number of reviewed vinyl records before removing products with less than 3 reviews: 12192 Number of reviewed vinyl records after removing products with less than 3 reviews: 1848
In order to investigate how the data is distributed in the vinyl records subset, the following four distributions are plotted:
number_of_reviews_per_product = []
for product in grouped:
number_of_reviews_per_product.append(len(product))
# plot a histogram of the values
plt.hist(number_of_reviews_per_product, bins = 20, log=True)
#plt.xlim(0, 300)
# set the plot title and axis labels
plt.title('The distribution of the number of reviews per product')
plt.xlabel('The number of reviews per product')
plt.ylabel('Number of products')
# show the plot
plt.show()
# Group the reviews by reviewerID
reviewer_groups = df_filtered_Vinyl_3.groupby('reviewerID')
# Iterate over the groups and create a list of productIDs for each reviewer
product_lists = {}
for reviewerID, group in reviewer_groups:
product_list = list(set(group['asin']))
product_lists[reviewerID] = product_list
number_of_reviews_per_person = []
for products in product_lists.values():
number_of_reviews_per_person.append(len(products))
# plot a histogram of the values
plt.hist(number_of_reviews_per_person, bins = 20, log=True)
#plt.xlim(0, 300)
# set the plot title and axis labels
plt.title('The distribution of the number of reviews per person')
plt.xlabel('The number of reviews per person')
plt.ylabel('Number of people')
# show the plot
plt.show()
length_of_reviews = []
for review in df_filtered_Vinyl_3['reviewText']:
length_of_reviews.append(len(review))
arr = np.array(length_of_reviews)
_, bins = np.histogram(np.log10(arr + 1), bins='auto')
# plot a histogram of the values
plt.hist(length_of_reviews, bins = 10**bins, log=True)
plt.gca().set_xscale("log")
#plt.xlim(0, 300)
# set the plot title and axis labels
plt.title('The distribution of the length of reviews')
plt.xlabel('Length of reviews')
plt.ylabel('Number of reviews')
# show the plot
plt.show()
# Group the reviews by reviewerID and calculate mean rating
reviewer_rank = df_filtered_Vinyl_3.groupby('reviewerID')['overall'].mean()
# Convert the result to a dictionary
reviewer_rank_dict = dict(reviewer_rank)
count_rank = Counter(reviewer_rank_dict.values())
plt.title('The distribution of average rating per person')
plt.bar(count_rank.keys(), count_rank.values(), width = 0.3, log = True)
plt.xlabel('Average rate')
plt.ylabel('Number of people')
# show the plot
plt.show()
Construct a social network based on the vinyl records subset, where reviewers are nodes and a link with weight x indicates that two reviewers have bought x number of the same product. That is, an undirected weighted graph is constructed.
# Create an empty dictionary to store the weighted edge list
edge_list = {}
# Iterate through each group of reviewerIDs
for reviewerIDs in grouped.values:
# Generate unique pairs of reviewerIDs
pairs = [(reviewerIDs[i], reviewerIDs[j]) for i in range(len(reviewerIDs)) for j in range(i + 1, len(reviewerIDs))]
# Update the dictionary with the number of products they both have bought
for pair in pairs:
key = tuple(sorted(pair))
if key in edge_list:
edge_list[key] += 1
else:
edge_list[key] = 1
# Convert the dictionary to a list of tuples, where the first two variables are reviewerIDs and the last variable is the
# number of products that the two rewiewer both have bought
weighted_edge_list = [(key[0], key[1], value) for key, value in edge_list.items()]
# Create an empty undirected graph
G = nx.Graph()
# Add weighted edges from the edge list to the graph
G.add_weighted_edges_from(weighted_edge_list)
The following node attributes are added to the graph:
# Create a dictionary mapping reviewerID to reviewerName
name_dict = dict(zip(df_filtered_Vinyl_3['reviewerID'], df_filtered_Vinyl_3['reviewerName']))
# Add the 'reviewerName' and 'products_reviewed' attribute to each node in G
for node in G.nodes():
G.nodes[node]['reviewerName'] = name_dict.get(node)
if node in product_lists:
G.nodes[node]['products_reviewed'] = len(product_lists[node])
else:
G.nodes[node]['products_reviewed'] = 0
# Add the 'avg_rank' attribute to each node
nx.set_node_attributes(G, reviewer_rank_dict, 'avg_rank')
# Number of nodes
num_nodes = G.number_of_nodes()
# Number of links
num_links = G.number_of_edges()
print("Number of nodes: ", num_nodes)
print("Number of links: ", num_links)
Number of nodes: 13170 Number of links: 246899
# Compute the degree for each node
degrees = dict(G.degree())
# Sort the nodes by degree in descending order and extract the top 5
top_reviewers = sorted(degrees.items(), key=lambda x: x[1], reverse=True)[:5]
# Print the attributes for the top 5 reviewers by degree
print("Top 5 reviewers by degree:")
for reviewer_id, degree in top_reviewers:
reviewer_name = G.nodes[reviewer_id]['reviewerName']
product_ids = G.nodes[reviewer_id]['products_reviewed']
avg_rank = G.nodes[reviewer_id]['avg_rank']
print(f"Reviewer ID: {reviewer_id}, Reviewer Name: {reviewer_name}, Degrees: {degree}, Avg Rating: {avg_rank}, Product IDs: {product_ids}")
Top 5 reviewers by degree: Reviewer ID: ARHJOL4GRDB95, Reviewer Name: Amazon Customer, Degrees: 523, Avg Rating: 4.333333333333333, Product IDs: 6 Reviewer ID: A2R2KWQNIG07O, Reviewer Name: falloutromance, Degrees: 469, Avg Rating: 4.666666666666667, Product IDs: 3 Reviewer ID: AMKSCB1X545PN, Reviewer Name: martin russell, Degrees: 464, Avg Rating: 5.0, Product IDs: 3 Reviewer ID: A1S06O7AYKIWV, Reviewer Name: 33rpm, Degrees: 440, Avg Rating: 5.0, Product IDs: 4 Reviewer ID: A38H7D8U24ECFK, Reviewer Name: Robert Levoy, Degrees: 422, Avg Rating: 5.0, Product IDs: 4
# Get the degree of each node
degrees = [G.degree(n) for n in G.nodes()]
# Calculate the median of the degree sequence
median_degree = np.median(degrees)
mean_degree = np.mean(degrees)
min_degree = np.min(degrees)
max_degree = np.max(degrees)
print('Degree information')
print(f'Median degree: {median_degree}')
print(f'Mean degree: {mean_degree}')
print(f'Min degree: {min_degree}')
print(f'Max degree: {max_degree}')
Degree information Median degree: 13.0 Mean degree: 37.49415337889142 Min degree: 1 Max degree: 523
#Plot the degree distribution
plt.hist(degrees, bins=30, log = True)
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.title('Degree Distribution')
plt.show()
The vinyl records subset is obtained by filtering the Digital Music (5-core) data set which is 37.8 MB. The filtered vinyl records subset, where each product has received 3 or more reviews, contains 14,822 reviews and reviews for 1848 products in total.
Looking at the distributions plotted above, it is observed that most of the products receive under 25 reviews each. Only a few products have received more than 75 reviews. In general, the number of reviews that a product has received decreases as the number of products increases. Also, most of the reviewers only give less than 5 reviews and the maximum number of reviews that a person has given within the data is just above 70. The length of reviews is approximately normally distributed with a mean of around 300 characters.
By looking at the distribution of average ratings per person, it can be seen that most reviewers tend to give high ratings, with 5 being the most common rating. However, there is also a noticeable amount of low ratings. Interestingly, among the lower scores, it is rating 1 that is larger compared to ratings 2 and 3. This could suggest that people who liked the product tend to give it a higher rating, while those who did not like it simply give it the lowest rating possible. It’s also worth noting that whole numbers are more common compared to point numbers, as most people only give out one review, which then becomes their average rating.
An undirected weighted social network is constructed by using the filtered vinyl record subset where reviewers are nodes and a link with weight x indicates that two reviewers have bought x number of the same product. The network contains 13,170 nodes and 246,899 links in total. The top 5 reviewers by degree are founded and printed above showing that these are the people who have bought the same products as other people within the data set.
Lastly, the network degree is investigated. The median degree is 13, the mean degree is 37.49, the minimum degree is 1 whereas the maximum degree is 523. The degree distribution shows that most of the nodes have a degree lower than 200.
This section contains the following subsections:
The sections below are divided into the following network science tools that are used in the analysis
Each of the two first sections starts with a short theory paragraph and ends with a short conclusion. The community section starts with a short theory paragraph and short conclusions are made along the steps.
In social network analysis, degree centrality, betweenness centrality, and closeness centrality are three common measures used to identify and quantify the importance of nodes in a network.
Degree centrality measures the fraction of nodes a node is connected to in the network. A node with a high degree of centrality is often considered important because it has a large number of connections and is therefore well-connected to other nodes in the network.
Betweenness centrality of a node is a measurement of how often a node appears on the shortest path between two other nodes since it is the sum of the fraction of all pairs of shortest paths that pass through the node. A high betweenness centrality indicates that the node has a lot of influence over the flow of information in a graph.
Closeness centrality measures the efficiency of a node to spread information through the network. A node with a high closeness centrality is often considered important because it is located close to many other nodes in the network and can quickly spread information to them.
The top 3 nodes by degree centrality, betweenness centrality, and closeness centrality are founded below.
# Calculate centrality measures
degree_centrality = nx.degree_centrality(G)
betweenness_centrality = nx.betweenness_centrality(G)
closeness_centrality = nx.closeness_centrality(G)
# Print the top 10 nodes by each centrality measure
print('Top 3 nodes by degree centrality:')
for node, centrality in sorted(degree_centrality.items(), key=lambda x: x[1], reverse=True)[:3]:
print(node, centrality)
print('Top 3 nodes by betweenness centrality:')
for node, centrality in sorted(betweenness_centrality.items(), key=lambda x: x[1], reverse=True)[:3]:
print(node, centrality)
print('Top 3 nodes by closeness centrality:')
for node, centrality in sorted(closeness_centrality.items(), key=lambda x: x[1], reverse=True)[:3]:
print(node, centrality)
Top 3 nodes by degree centrality: ARHJOL4GRDB95 0.03971448097805452 A2R2KWQNIG07O 0.03561394183309287 AMKSCB1X545PN 0.03523426228263346 Top 3 nodes by betweenness centrality: A1GGOC9PVDXW7Z 0.04757637422861338 ARHJOL4GRDB95 0.03590701410501194 A38H7D8U24ECFK 0.03033025637226728 Top 3 nodes by closeness centrality: ARHJOL4GRDB95 0.23198999123634903 A38H7D8U24ECFK 0.22268825988125707 A163GYGV123WKV 0.22199765847574685
Above, the top 3 nodes by degree centrality, betweenness centrality, and closeness centrality are founded respectively. It is observed that all the centrality measures are relatively low indicating that there is not a node that is much more important for the network than the other nodes. But the nodes listed above are still the most important nodes within the network.
Assortativity is a measure of the preference of a node in a network to be attached to others that are similar to themselves in some way.
The assortativity coefficient lies between -1 and 1. If nodes tend to be connected to other nodes with similar attribute values, the assortativity coefficient is close to 1. If nodes tend to be connected to other nodes with different attribute values, the assortativity coefficient is close to -1. If there is no correlation between the values of the chosen node attribute for neighboring nodes, the assortativity coefficient is close to 0.
Here, we calculate degree assortativity, average rating assortativity, and the number of products reviewed assortativity.
Degree assortativity refers to the tendency for nodes with a high degree in a network to be connected to other nodes with high degrees, and similarly for low-degree nodes. Average rating assortativity shows then the tendency for nodes of the high average rating in a graph to be connected to high average rating nodes and vice versa. Lastly, products reviewed assortativity shows the tendency for nodes of a high number of products reviewed in a graph to be connected to a high number of products reviewed nodes and vice versa.
# Compute the assortativity coefficient for degree
r = nx.degree_assortativity_coefficient(G)
print(f"Assortativity coefficient for degree: {r:.4f}")
r1 = nx.attribute_assortativity_coefficient(G, 'avg_rank')
print(f"Assortativity coefficient for average rating: {r1:.4f}")
r2 = nx.attribute_assortativity_coefficient(G, 'products_reviewed')
print(f"Assortativity coefficient for number of products reviewed: {r2:.4f}")
Assortativity coefficient for degree: 0.7497 Assortativity coefficient for average rating: 0.0350 Assortativity coefficient for number of products reviewed: 0.0039
The assortativity coefficient for degrees is 0.75 showing that the network is assortative where the nodes with high degrees tend to be connected with other nodes with high degrees. Whereas the assortativity found for the other two attributes, average rating and number of products reviewed, are both very close to 0, indicating there is no significant relation between the connected nodes and their average rating or number of products reviewed.
In social network analysis, communities refer to groups of nodes within a network that are densely connected but relatively less connected to nodes in other parts of the network. These groups are identified using the Python-Louvain algorithm that aims to partition the network into groups of nodes that are more similar to each other than to nodes in other groups. Additionally, the weighted modularity is computed.
# Compute the community using the Louvain algorithm
partition = community.best_partition(G, weight='weight')
# Compute the weighted modularity
modularity = community.modularity(partition, G, weight='weight')
print('Weighted modularity:', modularity)
Weighted modularity: 0.8981555026022325
The weighted modularity of 0.898 indicates that there is a strong community structure in the network, where nodes tend to be more strongly connected to others within their community than to nodes in other communities. This is a high value, suggesting that the network is highly modular and the partition's corresponding community structure is great.
# Store the communities in a dictionary, where keys are nodes and values are the commmunity the node belongs to
c_attributes = {}
for node, comm_id in partition.items():
c_attributes[node] = comm_id
# Get the largest connected components in G and store it into a graph
largest_cc = max(nx.connected_components(G), key=len)
largest_cc_graph = G.subgraph(largest_cc)
# Dictionary where keys are the communities and values are the number of nodes that the community has
count = Counter(c_attributes.values())
# Find all communities with fewer than 310 nodes such that only the top 5 communities are colored
com_20 = []
for community, n_nodes in count.items():
if n_nodes < 310:
com_20.append(community)
# Modify the comminity dictionary such that communities with fewer than 310nodes get the same white color
#for node, community in c_attributes.items():
# if community in com_20:
# c_attributes[node] = '#FFFFFF'
# Add the 'avg_rank' attribute to each node
nx.set_node_attributes(largest_cc_graph, c_attributes, 'group')
# Plot the network
nw.interactive.visualize(largest_cc_graph)
(None, None)
| |
|
The presence of densely connected groups of nodes in the graph suggests the existence of communities within the network. These communities are groups of nodes that have more connections with each other than with the rest of the network. The formation of communities can be attributed to various factors such as shared interests, common goals, and social relationships.
In the image of the network structure to the left, each community has been assigned a unique color. However, on the right image, it is only the 5 largest communities in the network that have been given a unique color, while the rest have been assigned white.
Looking at the image with the top 5 communities being colored. Even though some clusters in the network may appear separate, it can be observed that there are clusters with the same color, indicating that they belong to the same community. This suggests that although these smaller clusters may not be closely connected to larger clusters, the content or genre of the products they purchased is similar enough to link them to a particular community. An example of this situation could be that two different clusters represent two separate albums each, but the genre of both of these albums is the same.
In other words, it is believed that the color coding provides insight into how the music genres purchased by individuals can lead to the formation of communities, even among buyers who may not appear directly connected in the network - that is buyers from two different clusters being in the same community.
To explore the founded communities, the top 5 communities by the number of nodes are founded. The distribution of community sizes is plotted. Also, the average degree of each of the top 5 communities is computed and the distribution of average degrees of communities is shown below. Also, the average clustering coefficient is calculated for each of the top 5 communities to investigate how well-connected the nodes are in the community. Finally, the betweenness centrality for nodes within each of the top 5 communities is explored to see which nodes are most important in terms of connecting different parts of the community.
# Count the number of communities
num_communities = len(set(partition.values()))
print(f"There are {num_communities} communities.")
# Get a list of all the communities in the network
communities = list(set(partition.values()))
# Count the number of nodes in each community
community_sizes = Counter(partition.values())
# Get the 5 largest communities by node count
largest_communities = community_sizes.most_common(5)
# Find index of top 5 communities by nodes
largest_communities_index = [t[0] for t in largest_communities]
print('Top 5 communities by nodes')
# Print the number of nodes in each of the 5 largest communities
for community, size in largest_communities:
print("Community {}: {} nodes".format(community, size))
# Get the 10 largest communities by node count
largest_10_communities = community_sizes.most_common(10)
# Find index of top 10 communities by nodes
largest_10_communities_index = [t[0] for t in largest_10_communities]
There are 838 communities. Top 5 communities by nodes Community 17: 373 nodes Community 42: 362 nodes Community 39: 346 nodes Community 6: 338 nodes Community 31: 338 nodes
# Compute the size of each community
community_sizes = np.zeros(num_communities)
for i in range(num_communities):
community_sizes[i] = len([n for n in G.nodes() if partition[n] == i])
# Plot a histogram of community sizes
plt.hist(community_sizes, bins=50, log = True)
plt.title('Community sizes')
plt.xlabel("Community size")
plt.ylabel("Number of communities")
plt.show()
The above distribution shows that most of the communities have a size under 25 whereas some communities have a size up to around 400 nodes.
# Compute the average degree of each community
avg_degrees = np.zeros(num_communities)
for i in range(num_communities):
nodes_in_community = [n for n in G.nodes() if partition[n] == i]
subgraph = G.subgraph(nodes_in_community)
avg_degree = sum(dict(subgraph.degree(weight='weight')).values()) / len(nodes_in_community)
avg_degrees[i] = avg_degree
# Find average degree of each of the top 5 community
for community in largest_communities_index:
print(f'The average degree of community {community} is {avg_degrees[community]}')
The average degree of community 17 is 15.484029484029485 The average degree of community 413 is 195.8997493734336 The average degree of community 39 is 25.705202312138727 The average degree of community 6 is 28.775147928994084 The average degree of community 33 is 105.57692307692308
# Plot a histogram of average degrees
plt.hist(avg_degrees, bins=50, log = True)
plt.title('Average degree for communities')
plt.xlabel("Average degree")
plt.ylabel("Number of communities")
plt.show()
The distribution above shows that most of the communities have an average degree under 20 and the highest average degree among the communities is just below 200.
# Calculate average clustering coefficient and average avg_rank for top 5 community by nodes
clustering_coeffs = []
for c in largest_communities_index:
subgraph = G.subgraph([n for n, p in partition.items() if p == c])
clustering_coeffs.append(nx.average_clustering(subgraph))
# Print average clustering coefficient and average avg_rank for each community
for i,c in enumerate(largest_communities_index):
print(f"Community {c}:")
print(f" Average clustering coefficient: {clustering_coeffs[i]}")
Community 17:
Average clustering coefficient: 0.917984411530378
Community 413:
Average clustering coefficient: 0.9785924179664399
Community 39:
Average clustering coefficient: 0.9638518685152037
Community 6:
Average clustering coefficient: 0.975156002496449
Community 33:
Average clustering coefficient: 0.9861471933798358
Looking at the average clustering coefficient for each of the top 5 communities printed above, it can be seen that all the values are very close to 1 meaning that each of the communities is fully connected such that every node in the community is connected to almost every other node in the community. This indicates that reviewers within a community share common interests which in this case is that have reviewed the same product/products as the other reviewers within the same community. The high average clustering coefficient also suggests that there are relatively few connections between nodes in different communities, which may indicate that the community is relatively isolated from others in the network.
# Calculate betweenness centrality for nodes within each community
for c in largest_communities_index:
print(f"Community {c}:")
subgraph = G.subgraph([n for n, p in partition.items() if p == c])
betweenness = nx.betweenness_centrality(subgraph, weight='weight')
sorted_betweenness = sorted(betweenness.items(), key=lambda x: x[1], reverse=True)
print(" Top 3 nodes by betweenness centrality:")
for i in range(3):
print(f" {sorted_betweenness[i][0]}: {sorted_betweenness[i][1]}")
Community 17:
Top 3 nodes by betweenness centrality:
A1GGOC9PVDXW7Z: 0.8423645320197044
A2BH6WJF43495R: 0.25250866630176977
A18H04ULXPZ1HG: 0.18985586571793467
Community 413:
Top 3 nodes by betweenness centrality:
A2Z23CTP8M7PO9: 0.02990378718497563
A2E917SCQESVM7: 0.02542891872725737
A1IJJNWFRM7NF5: 0.02542891872725737
Community 39:
Top 3 nodes by betweenness centrality:
AG3J29DNQ4XHO: 0.2999159404240294
A2FEWIT2GI9HFF: 0.2881700691736081
AXHE1U679RIPU: 0.2114341085271318
Community 6:
Top 3 nodes by betweenness centrality:
AIF2OQYPVV1ET: 0.5003267627525787
AYATUJYQK9UFZ: 0.35438745231030094
A1RNZU3ZTT5R2F: 0.19139465875370917
Community 33:
Top 3 nodes by betweenness centrality:
A2JD032SBGIT34: 0.08059034087502635
A3K22GV3IBLVCL: 0.0748004208883094
A28RVNAKTHXKV: 0.06777009401266754
The top 3 nodes by betweenness for the top 5 communities are listed above. They are nodes that influence most of the flow of information in the network. Most of the centrality values above are relatively low except the top 1 node by centrality for community 17.
Before conducting text analysis, a data frame called community_tokens is created containing tokenized reviews for each of the communities.
The text is tokenized by the following steps:
# Store the communities in a dictionary where keys are community ids and values are the nodes belonging to the community
communities = {}
for node, comm_id in partition.items():
if comm_id not in communities:
communities[comm_id] = [node]
else:
communities[comm_id].append(node)
# Create a dictionary to store the review tokens for each community
community_reviews = {}
# define set of stopwords and punctuation
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)
remove = stop_words.union(punctuation)
# Iterate over each community and find all the reviews written by members of the community
for community, members in communities.items():
# Create a list to store the review tokens for each member of the community
member_reviews = []
for member in members:
# Find all the reviews written by the member
member_rows = df_filtered_Vinyl[df_filtered_Vinyl['reviewerID'].apply(lambda x: member in x)]
# Create the review tokens for each review and add them to the member_reviews list
for review in member_rows['reviewText']:
# Tokenize the reviews
tokens = nltk.word_tokenize(review)
tokens_new = [token.lower() for token in tokens if (token.lower() not in stop_words) and (token.lower() not in punctuation) and (token.lower().isalpha())]
tokens_joined = ' '.join(tokens_new)
member_reviews.append(tokens_joined)
member_reviews_j = ' '.join(member_reviews)
# Add the member_reviews list to the community_reviews dictionary
community_reviews[community] = member_reviews_j
# Convert the dictionary to a Pandas Series and sort by index
community_tokens = pd.Series(community_reviews).sort_index()
Then a subset of the data frame community_tokens is constructed such that the subset only contains tokens for the 10 largest communities. Also, a list containing all tokens from the 10 largest communities is created.
largest_10_communities_tokens = community_tokens.loc[largest_10_communities_index]
# filter dictionary by desired keys
largest_10_communities_tokens_dict = {key: value for key, value in community_reviews.items() if key in largest_10_communities_index}
# reviews fra alle top 10 communities sat sammen til en list [com1, com2, com3...]
all_list = []
for value in largest_10_communities_tokens_dict.values():
all_list.append(value)
Word clouds are created for the top 10 communities using TF-IDF where the words with high TF-IDF scores appear more distinctly in the word clouds than words with a low TF-IDF score. For the TfidfVectorizer function, the max_df percentage was chosen to be 0.8, as the words, on 0.7, filter out seemed to carry some significant information about the genre of the music in question. Furthermore, the reason for not choosing a higher percentage is that the words did not add any importance to the later analysis of the word clouds. The max_df percentage is the number of times the particular words are allowed to be repeated in all of the communities, so if a word appears in more than 80% of all of the communities, that word will get a lower score.
# Fit the TfidfVectorizer object to the data
tfidf = TfidfVectorizer(use_idf=True, max_df = 0.8).fit(all_list)
# word_cloud_function
def word_clouds(text):
# Transform the data into a TF-IDF matrix
tfidf_matrix = tfidf.transform([text])
# Convert the matrix into a Pandas DataFrame
df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names())
# Generate the word cloud
wordcloud = WordCloud(background_color="white", width=800, height=400, max_words=200).generate_from_frequencies(df.iloc[0])
# Display the word cloud
plt.figure(figsize=(10, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
for key, value in largest_10_communities_tokens_dict.items():
print(f'Community: {key}')
word_clouds(value)
Community: 3
Community: 6
Community: 23
Community: 30
Community: 44
Community: 33
Community: 41
Community: 49
Community: 234
Community: 416
The word cloud seems to be referring to the jazz genre. It contains words related to music charts, orchestras, piano, saxophone, trumpet, and various jazz musicians such as John Mulligan, Dave Brubeck, George Benson, and Les Brown. The inclusion of "Beatles" may suggest some jazz-inspired or jazz-influenced music by the Beatles, such as their album "Sgt. Pepper's Lonely Hearts Club Band" which has elements of jazz. However, overall, the emphasis seems to be on jazz music and its various subgenres.
Based on the word cloud, it's difficult to determine a specific genre of music that it might be referring to. The mention of a band called "Twenty-One Pilots" and their albums "Vessel" and "Blurryface" suggest that the genre might be alternative or pop rock. The words "meaning" and "opinions" also suggest that the music may have a deeper or more introspective lyrical content. Overall, it seems to be referencing modern music that may have a strong online presence and fanbase.
Based on the words in the word cloud, it appears to be referring to various artists and bands from different genres of music. The inclusion of artists such as Stevie Nicks, Bob Dylan, and Roger Waters suggests a focus on classic rock, while the mention of demos and alternate versions of songs suggests a potential interest in rarities and unreleased material. Additionally, the presence of newer artists such as Mumford & Sons and Regina Spektor suggests a potential interest in contemporary indie/folk music. Overall, it seems like this word cloud is referencing a broad range of musical styles and eras, rather than any specific genre.
The word cloud seems to be referring to the genre of alternative rock and metal, with bands such as Deftones, Radiohead, and Megadeth included. There are also references to psychedelic music and experimental rock, with mentions of Les Claypool and Anton Newcombe. The inclusion of artists like Cyndi Lauper and Sean Lennon suggests that the word cloud may also be referencing alternative or indie pop. Overall, it seems to be a diverse range of alternative and experimental music genres.
Based on the words in the word cloud, it seems to refer to the genre of alternative or indie pop/rock. The inclusion of artists such as CHVRCHES, Lana Del Rey, The XX, and Katy Perry suggests a focus on more modern and popular acts in the genre, while the references to anime films like "Spirited Away" and "Princess Mononoke" as well as words like "retro" and "pixies" suggest a possible interest in older or more niche works within the genre. Overall, the word cloud seems to suggest a focus on dreamy, atmospheric, and perhaps introspective music.
Based on the words in the word cloud, it is difficult to determine a specific genre of music that it may be referring to. However, there are some notable bands and artists mentioned such as Counting Crows, The Beatles, and The Decemberists, as well as some references to percussion and jazz. Therefore, it could be possible that this word cloud is related to alternative rock or indie music with elements of jazz or experimental percussion.
Based on the words in the word cloud, it seems likely that the genre of music being referred to is heavy metal or thrash metal. Words such as "Metallica", "metal", "paul", "petty", "McCartney", "solo", "bonnie", "lightning", "Sinatra", "Megaforce", "man", "thrash", "ride", "puppets", "blackened", and "kill" all have associations with heavy metal or thrash metal music. Other words such as "soul", "medley", and "stranger" could also suggest a fusion of metal with other genres. Overall, the presence of words such as "Metallica", "thrash", "puppets", and "blackened" seem to strongly suggest heavy metal or thrash metal as the genre being referred to.
The word cloud seems to be referring to the funk and soul music genres, particularly featuring artists such as James Brown, Fred Wesley, and other musicians who were influential in the development of funk music. The presence of words like "funk," "soul," "trumpet," "sax," "organ," and "electric" suggests that this word cloud is highlighting various aspects of funk and soul music, including the use of brass and keyboard instruments, as well as the emphasis on groove and rhythm. Additionally, the appearance of "disco" and "psychedelic" may suggest that the word cloud is also referencing some of the cross-genre experimentation that occurred during the heyday of funk and soul music in the 1970s.
Based on the words in the word cloud, it seems like the genre of music being referred to is progressive rock. The presence of the band name "Pink Floyd" and the names of its members such as "Gilmour", "Wright", "Waters", "Mason", and "Richard" suggests that the word cloud is related to this band's music. Other words such as "instrumental", "ambient", "lapse", "final", and "endless" also point to a progressive rock genre, which often features longer, instrumental passages and ambient soundscapes. Additionally, the presence of other artists such as "Robert Browne", "Moon", and "Cash" might suggest a wider interest in classic rock or related genres.
Based on the words in the word cloud, it seems that it is referring to the music industry in general, rather than a specific genre of music. The words relate to the Billboard charts and various record labels, such as RCA, Capitol, and Columbia. The presence of terms like "orchestra," "solo," and "vocalist" suggests that the word cloud is related to popular music that may feature various instrumentation, vocal performances, and collaborations.
To determine whether or not there is a difference in the reviews for the different genres a sentiment analysis is performed on the top 10 communities by using the Vader library.
# Create a dictionary to store the review tokens for each community
community_reviews = {}
# Iterate over each community and find all the reviews written by members of the community
for community, members in communities.items():
# Create a list to store the review tokens for each member of the community
member_reviews = []
for member in members:
# Find all the reviews written by the member
member_rows = df_filtered_Vinyl[df_filtered_Vinyl['reviewerID'].apply(lambda x: member in x)]
# Create the review tokens for each review and add them to the member_reviews list
for review in member_rows['reviewText']:
text = str(member_rows['reviewText'].values[0])
member_reviews.append(text)
# Add the member_reviews list to the community_reviews dictionary
member_reviews_joined = (' '.join(member_reviews))
community_reviews[community] = member_reviews_joined
sentiment = SentimentIntensityAnalyzer()
for community in largest_10_communities_index:
sent = sentiment.polarity_scores(community_reviews[community])
print(f'Community {community}: {sent}')
Output from the cell above:
Community 42: {'neg': 0.048, 'neu': 0.718, 'pos': 0.234, 'compound': 1.0}
Community 412: {'neg': 0.047, 'neu': 0.608, 'pos': 0.345, 'compound': 1.0}
Community 75: {'neg': 0.097, 'neu': 0.716, 'pos': 0.187, 'compound': 1.0}
Community 31: {'neg': 0.047, 'neu': 0.77, 'pos': 0.183, 'compound': 1.0}
Community 228: {'neg': 0.061, 'neu': 0.638, 'pos': 0.301, 'compound': 1.0}
Community 33: {'neg': 0.061, 'neu': 0.716, 'pos': 0.223, 'compound': 1.0}
Community 6: {'neg': 0.069, 'neu': 0.678, 'pos': 0.254, 'compound': 1.0}
Community 39: {'neg': 0.051, 'neu': 0.707, 'pos': 0.243, 'compound': 1.0}
Community 3: {'neg': 0.058, 'neu': 0.74, 'pos': 0.202, 'compound': 1.0}
Community 30: {'neg': 0.031, 'neu': 0.826, 'pos': 0.143, 'compound': 1.0}
Going through the different genres of the communities found in the word cloud analysis, it can be seen that while some genres, such as rock, are more present in the vinyl space than others, it also calls into question how the buyers of the product feel about the chosen music genre.
Is there expected to be a lot of negativity regarding a genre more uncommon than others or will the sentiment analysis reveal that even niche genres have a positive reception among vinyl collectors? It's also worth noting that the presence of certain artists in the word cloud may suggest that they have a significant following within the vinyl community, regardless of the genre.
While the neutrality scores are the highest, this is not of great importance as this is looking at the overall amount of words in the review with a neutral sentiment. As most reviewers also like to use other words, not loaded with either positive or negative sentiment, to describe their experience listening to or buying a product.
Additionally, it can be seen that even spanning the genres, there is very little negativity in the reviews and comments that buyers of the products leave. There is a clear positive sentiment present in the majority of the communities analyzed, with most of the communities having a positive sentiment score above 0.2.
This suggests that the buyers of the product generally have a positive perception of the music genres represented in the communities.
The results, therefore, suggests that the product is generally well received among the buyers and that they have a positive opinion towards the music genres represented, although it should be noted that the genre of "mainstream pop" have a lower score of positivity compared to the rest. The results of the sentiment analysis make sense, as most people only buy music that they enjoy listening to.
The network constructed using the subset of Vinyl records has a distinctive structure where there are densely connected groups of nodes. The strong community structure in the network can also be seen by its weighted modularity of 0.9. By looking at the network structure, it was observed that some of the different clusters within the network belong to the same community. It is believed that the clusters belonging to the same community represent different albums or singers which are in the same genre. Due to the limitations of the data, it is not possible to look up the product name of a product since only the product ID is available, the above interpretation of the network structure cannot be verified.
Another thing that could be interesting to look at in network science is the properties of the network. This could be done by creating a random network with the same number of nodes as the review network and linking the nodes in every possible pair with probability p, such that the number of edges in the random network equals the number of edges in the review network.
Besides, the word clouds show that it is possible to identify a genre for each of the communities using TF-IDF in most cases. The identified genres have also confirmed the hypothesis stated before, that most vinyl records collectors within the chosen data set are most interested in the rock genre. Although again this interpretation can not be verified due to the limitations of the data.
By conducting sentiment analysis on reviews from the top 10 communities, it is founded that the reviewers generally have a positive perception among all communities.
Finally, it could be interesting to group the reviews by their average rating in 5 groups representing the 5 rating integers. By investigating the word clouds for each of the groups, other relations within the data may be found. Additionally, it could be explored if there is a relation between the sentiment of the reviews from the different groups and their average rating.